House Pricing Capstone Project

Author - Dhaval Javia

Problem Statement:

Hello, My name is Dhaval Javia and I am from India currently working at Infosys. I was thinking to moving abroad and for that first thing you search for when you reach there is House to stay (Of course Food comes first but for now let's keep the hungar aside.) So i want house on rent which i can afford, so initially price is the factor for me. After i stabilize in the country, i can go for buying a house so i need some sort of system where i can search and compare house rent prices as per neighborhood.

Business Model:

By the end of this notebook, we should be able to decide housing prices for neighborhood based on features and location and many other dependent parameters. Main focus of audiance is the person living in State for quite some time and now wants a house of his/her own in a choice of neighborhood.

Data Description

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

Following Data will be used to solve the problem at hand.

  1. Neighborhood data from Foursquare API
  2. Location data from geopy or foursquare API.
  3. Pricing dataset with house features.

Data Sources

We have data available from varius sources but i found this houses on rent price dataset from below sources which requies a lot of data scraping skills and time and processing power. Also, it had all the data avaiable like latitude, longitude, street name, apartment name, rent price, bathrooms and bedrooms. For extra features, we will use Foursquare API for nearby popular locations.

  1. Location data and neighborhood data from Foursquare API
  2. Raw datase creation using web scraping from https://www.torontorentals.com/toronto
  3. Pair Plots used for correlation. https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166

Methodology

Now, first of all, fetching data from website. Yeah its not an easy task especially when there are 5 home groups on a single webpage which has individual links and each group has multiple options of house as per features and price. Phew!

Now, we have ApartmentName, Pricing as per noofBedRooms and noofBathrooms, location data from Bing API and Neighborhood data from Bing API as well

We will first take a look at the acquired data and remove all unnecessary data and also clean some of the columns as they contain invalid data. For example, pricing column contains some string data which is of no relevance and also, some of the rows does not have neighborhood data or location data. If that is present then also, API has not fetched data properly and provided us data with United States Location which is of no use to us.

So, after cleaning some of the data, we will visulize the dependent variables and independent variables and see if those variables have anny effect on pricing.

After deciding what variables will be of use to us when mmodeling, We will group pricing by neighborhood, and house features use KMeans to cluster similar neighborhoods in a same cluster. We will use cluster = 10 as to gain some more flexiblity and provide user with more choice.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
%matplotlib inline
!pip install geocoder
!pip install selenium
!pip install geopy
!pip install pgeocode
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import requests
import urllib.request
import time
!pip install geolocator
!pip install geopy
!pip install geopy
!pip install folium
import json
import re

from geopy.extra.rate_limiter import RateLimiter
import geopy.geocoders as geocoders
from geopy.geocoders import Nominatim
Requirement already satisfied: geocoder in /opt/conda/envs/Python36/lib/python3.6/site-packages (1.38.1)
Requirement already satisfied: six in /opt/conda/envs/Python36/lib/python3.6/site-packages (from geocoder) (1.12.0)
Requirement already satisfied: ratelim in /opt/conda/envs/Python36/lib/python3.6/site-packages (from geocoder) (0.1.6)
Requirement already satisfied: future in /opt/conda/envs/Python36/lib/python3.6/site-packages (from geocoder) (0.17.1)
Requirement already satisfied: click in /opt/conda/envs/Python36/lib/python3.6/site-packages (from geocoder) (7.0)
Requirement already satisfied: requests in /opt/conda/envs/Python36/lib/python3.6/site-packages (from geocoder) (2.21.0)
Requirement already satisfied: decorator in /opt/conda/envs/Python36/lib/python3.6/site-packages (from ratelim->geocoder) (4.3.2)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->geocoder) (1.24.1)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->geocoder) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->geocoder) (2.8)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->geocoder) (2019.11.28)
Requirement already satisfied: selenium in /opt/conda/envs/Python36/lib/python3.6/site-packages (3.141.0)
Requirement already satisfied: urllib3 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from selenium) (1.24.1)
Requirement already satisfied: geopy in /opt/conda/envs/Python36/lib/python3.6/site-packages (1.18.1)
Requirement already satisfied: geographiclib<2,>=1.49 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from geopy) (1.49)
Requirement already satisfied: pgeocode in /opt/conda/envs/Python36/lib/python3.6/site-packages (0.2.1)
Requirement already satisfied: requests in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pgeocode) (2.21.0)
Requirement already satisfied: pandas in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pgeocode) (0.24.1)
Requirement already satisfied: numpy in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pgeocode) (1.15.4)
Requirement already satisfied: idna<2.9,>=2.5 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->pgeocode) (2.8)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->pgeocode) (2019.11.28)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->pgeocode) (1.24.1)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->pgeocode) (3.0.4)
Requirement already satisfied: pytz>=2011k in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pandas->pgeocode) (2018.9)
Requirement already satisfied: python-dateutil>=2.5.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pandas->pgeocode) (2.7.5)
Requirement already satisfied: six>=1.5 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas->pgeocode) (1.12.0)
Requirement already satisfied: geolocator in /opt/conda/envs/Python36/lib/python3.6/site-packages (0.1.1)
Requirement already satisfied: geopy in /opt/conda/envs/Python36/lib/python3.6/site-packages (1.18.1)
Requirement already satisfied: geographiclib<2,>=1.49 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from geopy) (1.49)
Requirement already satisfied: geopy in /opt/conda/envs/Python36/lib/python3.6/site-packages (1.18.1)
Requirement already satisfied: geographiclib<2,>=1.49 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from geopy) (1.49)
Requirement already satisfied: folium in /opt/conda/envs/Python36/lib/python3.6/site-packages (0.10.1)
Requirement already satisfied: jinja2>=2.9 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from folium) (2.10)
Requirement already satisfied: requests in /opt/conda/envs/Python36/lib/python3.6/site-packages (from folium) (2.21.0)
Requirement already satisfied: branca>=0.3.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from folium) (0.4.0)
Requirement already satisfied: numpy in /opt/conda/envs/Python36/lib/python3.6/site-packages (from folium) (1.15.4)
Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from jinja2>=2.9->folium) (1.1.0)
Requirement already satisfied: idna<2.9,>=2.5 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->folium) (2.8)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->folium) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->folium) (2019.11.28)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->folium) (1.24.1)
Requirement already satisfied: six in /opt/conda/envs/Python36/lib/python3.6/site-packages (from branca>=0.3.0->folium) (1.12.0)

Then again i didn't want to increase load on website so i just saved each response in a webpage and then read it from there.

Below is the code used for fetching and saving response or urls.

In [2]:
# # fetching response for all 260 webpages to build database
# all_web_pages = {}
# for i in range(0,260):
#     url = 'https://www.torontorentals.com/toronto?p=' + str(i+1)
#     print(url)
#     all_web_pages[i] = requests.get(url)
In [3]:
# #saving webpage reponse in a HTML file for future use.
# for i in range(0,260):
#     filename = "./htmls/" + str(i) + '.html'
#     print(filename)
#     with open(filename, "w", encoding="utf-8") as f:
#         f.write(all_web_pages[i].text)

Alright, the tricky part, processing htmls and fetching data between SCRIPT HTML tag. and using BeautifulSoup to get the content inside the tag and using url(inside tag content) to fetch various configurations related to configuration and price.

Once we looped in all the requied data in separate lists. we will create dataframe and save its content in a CSV. Now the following data takes a long time to run as it searches all urls and extracts requied data. Took about 2 hrs to fetch and format data for all 3000 rows.So we have commented below code to prevent execution again.

We can save data to a dataframe and then export it to CSV. We will use that CSV for later use and rerun of project.

In [4]:
# from os import listdir

## listing down columns for dataframe and and creating lists to save data in case code didn't ran properly. 
## We can use those lists to append data to dataframe and export it.
# columns = ['ApartmentName','latitude','longitude', 'NoofBedrooms','NoofBathrooms']
# apartname = []
# latitude1 = []
# longi = []
# bedrms = []
# bathrms = []
# priceL = []
# streetnames = []

# #iterating through each html for content
# j = 0
# for each in listdir("./htmls/"):

#     filename = './htmls/' + each
#     with open(filename, "r", encoding="utf-8") as f:
#         response1 = f.read()                 #saving response of each request to response1 var
        
#     #using soap to find script tag in html content
#     soap1 = BeautifulSoup(response1,"html.parser")
#     #soap1
#     script_dump = soap1.find_all("script")
    
#     #3 to -6 is defined after testing and found that all the necessary elements are in range. Although -1 contains all location,
#     # and other url infos which we could have used, but Meh. Ma Project, Ma rules!!!.
    
#     #iterating through each script html tag
#     for group in script_dump[3:-6]:
#         test1 = group
#         test2 = BeautifulSoup(test1.text,"html.parser")
        
#         #saving all necessary things like name, location, streetname, price, apartment type blah blah.
#         newDictionary=json.loads(test2.text)
#         NameofApartment = newDictionary['name']
#         streetAddress = newDictionary['address']['streetAddress']
#         print('---------------ApartMent Name:- ',NameofApartment,"------------------")
        
#         #getting url for the group and fetching more data from there like price, apartment type, no of bathrooms, etc.
#         response2 = requests.get(newDictionary['url'])
#         sub_soap = BeautifulSoup(response2.text,"html.parser")
#         sub_soap
#         table = sub_soap.find_all('td')
#         price_lst = sub_soap.find_all('td',{"class": "price"})
#         beds_lst = sub_soap.find_all('td',{"class": "beds"})
#         baths_lst = sub_soap.find_all('td',{"class": "baths"})
#         beds_lst = sub_soap.find_all('td',{"class": "baths"})
#         print(len(price_lst))
#         #print("Table:-- ", table)
#         #print((len(table)//5)*5)
#         #print(table)
#         for i in range(0,len(price_lst)):
            
# #             beds = BeautifulSoup((table[i+1]).text,"html.parser")
# #             bath = BeautifulSoup((table[i+2]).text,"html.parser")
# #             price = BeautifulSoup((table[i+3]).text,"html.parser")
# #             #price = str(price).split('\n')[1]
# #             print(newDictionary['url'])
#             beds = beds_lst[i].text.replace('\n','')
#             bath = baths_lst[i].text.replace('\n','')
#             price = price_lst[i].text.replace('\n','')
#             print("No of bedrooms are {}, No of bathrooms are {}, Price is {}".format(beds,bath,price))
#             apartname.append(NameofApartment)
#             streetnames.append(streetAddress)
#             priceL.append(price)
#             bedrms.append(beds)
#             bathrms.append(bath)
#             #latitude1.append(latitude)
#             #longi.append(longitude)
            
#             #saving all data in CSV and dataframe
#             #print({'ApartmentName':NameofApartment,'latitude':latitude,'longitude':longitude,'NoofBedrooms':beds,'NoofBathrooms':bath,'Price':price})
In [5]:
# #appending data to dataframe and saving it in a file.
# type(priceL)
# columns = ['ApartmentName', 'Streetname' , 'NoofBedrooms','NoofBathrooms', 'Price']
# print(columns)
# df = pd.DataFrame(columns=columns)
# df['Price'] = priceL
# df['ApartmentName'] = apartname
# df['NoofBathrooms'] = bathrms
# df['NoofBedrooms'] = bedrms
# #df['latitude'] = latitude1
# #df['longitude'] = longi
# df['Streetname'] = streetnames
# #df = df.append({'ApartmentName':NameofApartment,'latitude':latitude,'longitude':longitude,'NoofBedrooms':beds,'NoofBathrooms':bath,'Price':price},ignore_index=True)
# df.head()
In [6]:
#df.to_csv('./raw_df.csv')

Fetching saved data from CSV on IBM Cloud.

In [7]:
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_43833077c8684b8d83c91e7fabcbb244 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='-d5StrdvYziXsUkU1hMpR9YJzwoeoLUjiTLn2INALVeq',
    ibm_auth_endpoint="https://iam.ng.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_43833077c8684b8d83c91e7fabcbb244.get_object(Bucket='ibmdscapstoneproject-donotdelete-pr-345qkkpanezvxh',Key='raw_df.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df = pd.read_csv(body)
df.head()
Out[7]:
Unnamed: 0 ApartmentName Streetname NoofBedrooms NoofBathrooms Price Latitude longitude Neighborhood
0 0 571 Prince Edward Dr N 571 Prince Edward Dr N 1.0 1.0 $2,000 43.66018 -79.50995 Kingsway South
1 1 571 Prince Edward Dr N 571 Prince Edward Dr N 1.0 1.0 $2,300 43.66018 -79.50995 Kingsway South
2 2 571 Prince Edward Dr N 571 Prince Edward Dr N 2.0 2.0 $2,700 43.66018 -79.50995 Kingsway South
3 3 571 Prince Edward Dr N 571 Prince Edward Dr N 2.0 2.0 $3,850 43.66018 -79.50995 Kingsway South
4 4 eCentral 15 Roehampton Ave 1.0 1.0 $2,002.5 43.70794 -79.39786 Mt Pleasant West

Reading CSV we saved in previous step and removing unnamed index column

In [8]:
#str(price).split('\n')[1]
#df = pd.read_csv('./raw_df.csv',index_col=None)
df.drop('Unnamed: 0',inplace=True,axis=1) #dropping unnamed column from df
df.reset_index(drop=True)

df.head()
Out[8]:
ApartmentName Streetname NoofBedrooms NoofBathrooms Price Latitude longitude Neighborhood
0 571 Prince Edward Dr N 571 Prince Edward Dr N 1.0 1.0 $2,000 43.66018 -79.50995 Kingsway South
1 571 Prince Edward Dr N 571 Prince Edward Dr N 1.0 1.0 $2,300 43.66018 -79.50995 Kingsway South
2 571 Prince Edward Dr N 571 Prince Edward Dr N 2.0 2.0 $2,700 43.66018 -79.50995 Kingsway South
3 571 Prince Edward Dr N 571 Prince Edward Dr N 2.0 2.0 $3,850 43.66018 -79.50995 Kingsway South
4 eCentral 15 Roehampton Ave 1.0 1.0 $2,002.5 43.70794 -79.39786 Mt Pleasant West
In [9]:
df.shape
Out[9]:
(2705, 8)

Finding neighborhood based on Streetaddress and for failed ones, appending it to list for later use. Although after the execution of the code, it was found that some neighborhood data was incorrect and later next step was used to fix some of those and find NA Neighborhooods data.

In [10]:
#url = 'http://dev.virtualearth.net/REST/v1/Locations/CA/-/-/-/{}?&includeNeighborhood=1&key=AutK2PfFsISHtjUQr00rl2Kf_5tlpgYPtZzJUBZnwl8_NyIOydqyWW91RS4N7NQQ'
# df['Neighborhood'] = ''
# ignored = []
# for Streetname in df.Streetname.unique():
#     try:
#         url = 'http://dev.virtualearth.net/REST/v1/Locations/CA/-/-/-/{}?&includeNeighborhood=1&key=AutK2PfFsISHtjUQr00rl2Kf_5tlpgYPtZzJUBZnwl8_NyIOydqyWW91RS4N7NQQ'
#         url = url.format(Streetname.split(' | ')[0].replace(' ','%20'))
#         response = requests.get(url)
#         response = response.json()
#         print(response['resourceSets'][0]['resources'][0]['address']['neighborhood'])
#         print(response['resourceSets'][0]['resources'][0]['point']['coordinates'][0])
#         print(response['resourceSets'][0]['resources'][0]['point']['coordinates'][1])
#         print('')
#         df['Neighborhood'][df['Streetname'] == Streetname] = response['resourceSets'][0]['resources'][0]['address']['neighborhood']
#         df['Latitude'][df['Streetname'] == Streetname] = response['resourceSets'][0]['resources'][0]['point']['coordinates'][0]
#         df['longitude'][df['Streetname'] == Streetname] = response['resourceSets'][0]['resources'][0]['point']['coordinates'][1]
#     except:
#         ignored.append(Streetname)
    

Finding data for NA Neighborhooods via Bing API.

In [11]:
# responses = []
# for each in df[df.Latitude.isnull()].Streetname.unique():
#     print(each.split(' | ')[0])
#     try:
#         url = 'http://dev.virtualearth.net/REST/v1/Locations/CA/-/-/-/{}?&includeNeighborhood=1&key=AutK2PfFsISHtjUQr00rl2Kf_5tlpgYPtZzJUBZnwl8_NyIOydqyWW91RS4N7NQQ'

#         url = url.format((each.split(' | ')[0]).replace(' ','%20'))
#         response = requests.get(url)
#         response = response.json()
#         latitude = response['resourceSets'][0]['resources'][0]['point']['coordinates'][0]
#         neighborhood = response['resourceSets'][0]['resources'][0]['address']['locality']
#         longitude = response['resourceSets'][0]['resources'][0]['point']['coordinates'][1]

#         df['Neighborhood'][df['Streetname'] == each] = neighborhood
#         df['Latitude'][df['Streetname'] == each] = latitude
#         df['longitude'][df['Streetname'] == each] = longitude
#     except:
#         responses.append(response)

# #df['Neighborhood'][df['Streetname'] == Streetname] = a

Dropping rows if any columnn has NA value(s).

In [12]:
# df.dropna(axis=0,inplace=True)
# df.head()
# df.to_csv('./raw_df.csv')

Ok, Now our data looks like this.

In [13]:
# @hidden_cell
# response = requests.get('https://www.torontorentals.com/toronto')
# soap = BeautifulSoup(response.text,"html.parser")
# soap1 = soap.find_all('script')

# soap1 = soap1[-1]
# soap1.text

# import re
# f1 = re.sub("\n    \n      \n        \n        \n        \n\n        ",'', soap1.text)
# f1 = re.sub("\n    const markers = \[\];","",f1)
# f1 = re.sub("\n        ","",f1)
# f1 = re.sub("\n       ",'',f1)
# f1 = re.sub('   ','',f1)
# f1 = re.sub('\n\n','',f1)
# f1 = re.sub('markers.push\(','',f1)
# f1 = re.sub('\n','',f1)
# f1 = re.sub('\)','',f1)
# f2 = re.split(r'\;',f1)

# street_loc_dict = {}

# for i in range(0,len(f2)):
#     try:
#         a = f2[i]
#         #print(a)
#         streetname = re.sub("\'","",re.sub("\"","",re.findall(r'street:\s(.*?),',a)[0]))
#         name = re.sub("\'","",re.sub("\"","",re.findall(r'name:\s(.*?),',a)[0]))
#         lat = re.sub("\'","",re.sub("\"","",re.findall(r'lat: (.*?),',a)[0]))
#         lng = re.sub("(.*?)\,lng: ","",re.findall(r'lat: (.*?)\}',a)[0])
#         street_loc_dict[streetname] = [lat,lng]
#         print(streetname)
#         print(street_loc_dict[streetname])
#     except:
#         pass
In [14]:
# @hidden_cell
# tdf = pd.DataFrame(street_loc_dict.items())
# tdf.rename(columns = {0:'Streetname',1:'location'},inplace=True)
# tdf.head()
# tdf = pd.DataFrame(data=tdf.location.to_list(),index=tdf.Streetname)
# tdf.rename(columns = {0:'Latitude',1:'longitude'},inplace=True)
# tdf.reset_index(inplace=True)

# #tdf[tdf['Streetname'] == '15 Roehampton Ave']

# tdf.to_csv('./test.csv')

# # df_new1 = pd.merge(df_new,tdf,on='Streetname',how='outer')
# # # # df_new1.drop(['Latitude_x','Longitude'],inplace=True,axis = 1)
# # # # df_new1.rename(columns={'Latitude_y':'Latitude'},inplace=True)
# # df_new1.head()
In [15]:
# @hidden_cell
# response = requests.get('https://www.torontorentals.com/toronto')
# soap = BeautifulSoup(response.text,"html.parser")
# script_dump = soap.find_all('script')
# #script_dump[-1]
In [16]:
# @hidden_cell
# f1 = re.sub("\n    \n      \n        \n        \n        \n\n        ",'', script_dump[-1].text)
# f1 = re.sub("\n    const markers = \[\];","",f1)
# f1 = re.sub("\n        ","",f1)
# f1 = re.sub("\n       ",'',f1)
# f1 = re.sub('   ','',f1)
# f1 = re.sub('\n\n','',f1)
# f1 = re.sub('markers.push\(','',f1)
# f1 = re.sub('\n','',f1)
# f1 = re.sub('\)','',f1)
# f2 = re.split(r'\;',f1)
# name_of_apt_dict = {}

# for i in range(0,len(f2)):
#     try:
#         name = re.sub("\'","",re.sub("\"","",(re.findall(r'name:\s(.*?),',f2[i])[0])))
#         streetname = re.sub("\'","",re.sub("\"","",(re.findall(r'street:\s(.*?),',f2[i])[0])))
#         name_of_apt_dict[str(name)] = str(streetname.split('|')[0])
# #         print("name of apartment:- {}".format(name))
# #         print('streetname:- {}'.format(streetname))
# #         print(i)
#     except:
#         pass
    
In [ ]:
 

Data Exploration and Cleaning

In [17]:
df.head()
Out[17]:
ApartmentName Streetname NoofBedrooms NoofBathrooms Price Latitude longitude Neighborhood
0 571 Prince Edward Dr N 571 Prince Edward Dr N 1.0 1.0 $2,000 43.66018 -79.50995 Kingsway South
1 571 Prince Edward Dr N 571 Prince Edward Dr N 1.0 1.0 $2,300 43.66018 -79.50995 Kingsway South
2 571 Prince Edward Dr N 571 Prince Edward Dr N 2.0 2.0 $2,700 43.66018 -79.50995 Kingsway South
3 571 Prince Edward Dr N 571 Prince Edward Dr N 2.0 2.0 $3,850 43.66018 -79.50995 Kingsway South
4 eCentral 15 Roehampton Ave 1.0 1.0 $2,002.5 43.70794 -79.39786 Mt Pleasant West

Let's see what we are dealing with.(Even though we extracted data via webscraping and BING API so we already do know the data, but whatever!)

  1. We have ApartmentName, apartment configs like no of bedrooms, noofbathrooms, price, and other parameters like streetname and location parameters, neighborhoods extracted from Streetname.

  2. Now, what is the most important data here. As per my understaning, neighborhood and average pricing based on features is. We will see how this can be used as we progress further.

  3. Dependent variable(Price) is changing(increasing) as bedrooms and bathrooms( but more effect because of bedrooms variable). We need to co

In [18]:
import folium
toronto_map = folium.Map(location=[43.6532,-79.3832],zoom_start=12)
toronto_map.fit_bounds([[43.581028, -79.542861],[43.855465, -79.170700]])
toronto_map
Out[18]:

Now, initially i plotted map using all of the data points and found that many points are outliers.(i.e. not within toronto boundries.) So, we will clean data and format price and other variables now.

In [19]:
# for lat,lon, street, apt in zip(df['Latitude'], df['longitude'], df['Streetname'], df['ApartmentName']):
#     #print("Borough - {}, Neighbourhood - {}, latitude - {}, longitude - {}".format(borough,neighbourhood,lat,lon))
#     try:
#         label = folium.Popup(apt,parse_html=True)

#         folium.CircleMarker(
#             location=[lat,lon],
#             color = 'red',
#             radius = 5,
#             popup=label,
#             parse_html = True).add_to(toronto_map)
#     except:
#         pass
# display(toronto_map)

Checking Dtype of every variable in our data.

In [20]:
df.dtypes
Out[20]:
ApartmentName     object
Streetname        object
NoofBedrooms     float64
NoofBathrooms    float64
Price             object
Latitude         float64
longitude        float64
Neighborhood      object
dtype: object

Ah, see...i knew it. Data type of Price is object here and we want is float. Also, if you have looked at price variable, it has $ sign and a , sign. which we do not want. So we will go ahead and clean it

In [21]:
df.replace({'Price':'\$'},{'Price':''},regex=True,inplace=True)
In [22]:
df.replace({'Price':','},{'Price':''},regex=True,inplace=True)
In [23]:
df['Price'].astype('float64')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-23-9a6d98b54771> in <module>
----> 1 df['Price'].astype('float64')

/opt/conda/envs/Python36/lib/python3.6/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs)
   5689             # else, only a single dtype is given
   5690             new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
-> 5691                                          **kwargs)
   5692             return self._constructor(new_data).__finalize__(self)
   5693 

/opt/conda/envs/Python36/lib/python3.6/site-packages/pandas/core/internals/managers.py in astype(self, dtype, **kwargs)
    529 
    530     def astype(self, dtype, **kwargs):
--> 531         return self.apply('astype', dtype=dtype, **kwargs)
    532 
    533     def convert(self, **kwargs):

/opt/conda/envs/Python36/lib/python3.6/site-packages/pandas/core/internals/managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
    393                                             copy=align_copy)
    394 
--> 395             applied = getattr(b, f)(**kwargs)
    396             result_blocks = _extend_blocks(applied, result_blocks)
    397 

/opt/conda/envs/Python36/lib/python3.6/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
    532     def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
    533         return self._astype(dtype, copy=copy, errors=errors, values=values,
--> 534                             **kwargs)
    535 
    536     def _astype(self, dtype, copy=False, errors='raise', values=None,

/opt/conda/envs/Python36/lib/python3.6/site-packages/pandas/core/internals/blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
    631 
    632                     # _astype_nansafe works fine with 1-d only
--> 633                     values = astype_nansafe(values.ravel(), dtype, copy=True)
    634 
    635                 # TODO(extension)

/opt/conda/envs/Python36/lib/python3.6/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
    700     if copy or is_object_dtype(arr) or is_object_dtype(dtype):
    701         # Explicit copy, or required since NumPy can't view from / to object.
--> 702         return arr.astype(dtype, copy=True)
    703 
    704     return arr.view(dtype)

ValueError: could not convert string to float: 'Inquire'

Huh, our data contains Some string like 'Inquire'. I did not see that one coming!! Oh well, we will delete those data.

In [27]:
df = df[df['Price'] != 'Inquire']
In [28]:
df['Price'] = df.Price.astype('float64')
In [29]:
df.dtypes
Out[29]:
ApartmentName     object
Streetname        object
NoofBedrooms     float64
NoofBathrooms    float64
Price            float64
Latitude         float64
longitude        float64
Neighborhood      object
dtype: object
In [30]:
df.Price.describe()
Out[30]:
count     2689.000000
mean      2530.569394
std       1054.788000
min          1.000000
25%       2000.000000
50%       2350.000000
75%       2780.000000
max      19500.000000
Name: Price, dtype: float64

This Min value of 1 seems to be incorrect. (Not seems to be, it is incorrect. Who rents house at 1$. Was the owner on Weed!???). Anyways, we will go ahead and remove this value and see how much is min after that.

In [31]:
df[df['Price'] == 1]
Out[31]:
ApartmentName Streetname NoofBedrooms NoofBathrooms Price Latitude longitude Neighborhood
2685 972 Bathurst St 972 Bathurst St | Unit: 1 1.0 1.0 1.0 43.66977 -79.4132 Annex
2686 972 Bathurst St 972 Bathurst St | Unit: 2 1.0 1.0 1.0 43.66977 -79.4132 Annex

Oh, Just 2 values, seems like some sort of error in data gathering. Anyways, we can remove it rather than go back and fetch the values.

In [32]:
df = df[df['Price'] != 1]
In [33]:
df.Price.describe()
Out[33]:
count     2687.000000
mean      2532.452214
std       1052.918851
min        575.000000
25%       2000.000000
50%       2350.000000
75%       2784.000000
max      19500.000000
Name: Price, dtype: float64

Hmm, Still something is wrong, oh wait, the MAX Price of rent. Oh my God! 19500!!!?? Are you kidding me?. What Harvy Spector lives here or what. What is it Suits??

In [34]:
df[df['Price'] == 19500]
Out[34]:
ApartmentName Streetname NoofBedrooms NoofBathrooms Price Latitude longitude Neighborhood
198 101 Peter St 101 Peter St | Unit: 602 1.0 1.0 19500.0 43.64751 -79.39273 Waterfront Communities-the Island

we will remove it ofcourse. Another Data insertion error it seems.

In [35]:
df = df[df['Price'] != 19500]
In [36]:
df.Price.describe()
Out[36]:
count     2686.000000
mean      2526.135182
std       1000.892915
min        575.000000
25%       2000.000000
50%       2350.000000
75%       2780.000000
max      15000.000000
Name: Price, dtype: float64

we will plot graphs to see the outliers.

In [37]:
sns.distplot(df['Price'])
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe47f57be80>

We can definitely see some Suits guy living in a wealthy apartments with rents more than 8k per month. Let's see how many houses are there above 8k rent.

In [38]:
df[df['Price'] > 8000]
Out[38]:
ApartmentName Streetname NoofBedrooms NoofBathrooms Price Latitude longitude Neighborhood
245 Ian Serota 183 Wellington St W 2.5 2.5 10000.0 43.64541 -79.38732 Waterfront Communities-the Island
370 177 Lyndhurst Ave 177 Lyndhurst Ave 4.5 4.5 9800.0 43.68270 -79.41466 Casa Loma
522 68 Merton St 68 Merton St 4.0 4.0 8500.0 43.69683 -79.39395 Mt Pleasant West
620 1 Bedford Rd 1 Bedford Rd | Unit: 1802 2.5 2.5 10000.0 43.66853 -79.39713 Annex
706 311 Bay St 311 Bay St | Unit: 4805 1.0 1.0 10500.0 43.64977 -79.38040 Bay Street Corridor
707 63 St Mary St 63 St Mary St | Unit: Th01 3.0 3.0 10000.0 43.66706 -79.38889 Bay Street Corridor
1084 2095 Lake Shore Blvd W 2095 Lake Shore Blvd W | Unit: 617 1.0 1.0 15000.0 43.62943 -79.47728 Mimico
1165 11 Garnet Ave 11 Garnet Ave 4.5 4.5 10500.0 43.66890 -79.42144 Dovercourt-Wallace Emerson-Junction
1693 2350 Doulton Dr 2350 Doulton Dr 4.5 4.5 9000.0 43.54225 -79.64660 Erindale
1853 17 Mcclinchy Ave 17 Mcclinchy Ave 4.5 4.5 8800.0 43.65776 -79.51323 Kingsway South
2228 63 St Mary St 63 St Mary St 3.0 3.0 9000.0 43.66706 -79.38889 Bay Street Corridor
2440 St Regis 311 Bay St | Unit: 04 2.0 2.0 9500.0 43.64977 -79.38040 Bay Street Corridor
2620 386 Yonge Street, Toronto, ON, Canada 386 Yonge Street, Toronto, ON, Canada 2.0 2.0 8500.0 43.65936 -79.38250 Bay Street Corridor

Hmm, 2 things to notice here,

  1. there are few apartments with 8k plus rent(when i say few, i meant compared to 3k rows dataset which we have created.) We can probably remove these results from our analysis and i don't see much impact on our analysis too.
  2. see the last row? Observe the apartment name. We forgot to modify that column. Let's do that later. Although its just a name and won't have much impact but still it should look good, not some garbage.
In [39]:
df = df[df['Price'] <= 8000]
In [40]:
df.Price.describe()
Out[40]:
count    2673.000000
mean     2490.123120
std       852.060509
min       575.000000
25%      2000.000000
50%      2350.000000
75%      2750.000000
max      8000.000000
Name: Price, dtype: float64
In [41]:
sns.distplot(df['Price'])
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe47f55d240>

Ah, Much Better. Now, if you google enough and know a thing or 2 about the statstics, you will know about skewness. There are 2 kinds of that. Positive and Negative.

Positively skewed data: If tail is on the right as that of the second image in the figure, it is right skewed data. It is also called positive skewed data. Common transformations of this data include square root, cube root, and log.

  1. Cube root transformation: The cube root transformation involves converting x to x^(1/3). This is a fairly strong transformation with a substantial effect on distribution shape: but is weaker than the logarithm. It can be applied to negative and zero values too. Negatively skewed data.
  2. Square root transformation: Applied to positive values only. Hence, observe the values of column before applying.
  3. Logarithm transformation: The logarithm, x to log base 10 of x, or x to log base e of x (ln x), or x to log base 2 of x, is a strong transformation and can be used to reduce right skewness.

Negatively skewed data: If the tail is to the left of data, then it is called left skewed data. It is also called negatively skewed data. Common transformations include square , cube root and logarithmic.

  1. Square transformation: The square, x to x², has a moderate effect on distribution shape and it could be used to reduce left skewness.

Another method of handling skewness is finding outliers and possibly removing them.

Here, what we have is the positive Skewness data. We can probably use log transform method to fix it. Let's find out if that works or not

In [42]:
def normalize(column):
    upper = column.max()
    lower = column.min()
    y = (column - lower)/(upper-lower)
    return y
In [43]:
price_norm = normalize(df.Price)
sns.distplot(price_norm,fit=norm)
fig = plt.figure()
res = stats.probplot(df['Price'], plot=plt)
In [44]:
sns.distplot(np.log(df.Price),fit=norm)
fig = plt.figure()
res = stats.probplot(np.log(df['Price']), plot=plt)
In [45]:
sns.distplot(np.log10(df.Price),fit=norm)
fig = plt.figure()
res = stats.probplot(np.log10(df['Price']), plot=plt)
In [46]:
sns.distplot(np.power(df.Price,1/3),fit=norm)
fig = plt.figure()
res = stats.probplot(np.power(df['Price'],1/3), plot=plt)

Ah, looks perfect!. Isn't she a beauty?

We tried log, log10 and normalization of data using min and maximum values but did not work. Which worked for us was the cube root of the data.(Most of the time it works for highly skewed data)

Let's see the correlation of variables with others

In [47]:
corrmat = df.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);

Ah, as we predicted, Noofbathrooms and noofbedrooms mare highly correlated with Price.

Heatmap and Coorelation of variables via graphical visulization

In [48]:
#saleprice correlation matrix
k = 5 #number of variables for heatmap
cols = corrmat.nlargest(k, 'Price')['Price'].index
cm = np.corrcoef(df[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

Pairplot is the one of the best way to visulize the correlation between variables whether it is between dependent and independent variable or between independent variables

In [49]:
sns.set()
cols = ['Price', 'NoofBedrooms', 'NoofBathrooms']
sns.pairplot(df[cols], size = 2.5)
plt.show();
/opt/conda/envs/Python36/lib/python3.6/site-packages/seaborn/axisgrid.py:2065: UserWarning: The `size` parameter has been renamed to `height`; pleaes update your code.
  warnings.warn(msg, UserWarning)

Checking for missing data

In [50]:
#missing data
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
Out[50]:
Total Percent
Neighborhood 0 0.0
longitude 0 0.0
Latitude 0 0.0
Price 0 0.0
NoofBathrooms 0 0.0
NoofBedrooms 0 0.0
Streetname 0 0.0
ApartmentName 0 0.0

Great! No missing data as already taken care before.

In [51]:
df.head()
Out[51]:
ApartmentName Streetname NoofBedrooms NoofBathrooms Price Latitude longitude Neighborhood
0 571 Prince Edward Dr N 571 Prince Edward Dr N 1.0 1.0 2000.0 43.66018 -79.50995 Kingsway South
1 571 Prince Edward Dr N 571 Prince Edward Dr N 1.0 1.0 2300.0 43.66018 -79.50995 Kingsway South
2 571 Prince Edward Dr N 571 Prince Edward Dr N 2.0 2.0 2700.0 43.66018 -79.50995 Kingsway South
3 571 Prince Edward Dr N 571 Prince Edward Dr N 2.0 2.0 3850.0 43.66018 -79.50995 Kingsway South
4 eCentral 15 Roehampton Ave 1.0 1.0 2002.5 43.70794 -79.39786 Mt Pleasant West
In [ ]:
 
In [ ]:
 

Filtering data between a set of map points such that data falling outside will be ommitted.

In [52]:
df = df[df['Latitude'] <= 44]
df = df[df['Latitude'] >= 43]
In [53]:
df.describe()
Out[53]:
NoofBedrooms NoofBathrooms Price Latitude longitude
count 2606.000000 2606.000000 2606.000000 2606.000000 2606.000000
mean 1.375672 1.375672 2485.265579 43.683266 -79.417157
std 0.597891 0.597891 843.512464 0.082851 0.131730
min 1.000000 1.000000 575.000000 43.109820 -81.035639
25% 1.000000 1.000000 2000.000000 43.645480 -79.441142
50% 1.000000 1.000000 2350.000000 43.668003 -79.394570
75% 2.000000 2.000000 2750.000000 43.732525 -79.374160
max 4.500000 4.500000 8000.000000 43.998188 -78.889540
In [54]:
df = df[df['longitude'] > -80]
df = df[df['longitude'] < -79]
In [55]:
df.describe()
Out[55]:
NoofBedrooms NoofBathrooms Price Latitude longitude
count 2595.000000 2595.000000 2595.000000 2595.000000 2595.000000
mean 1.375915 1.375915 2486.211599 43.682774 -79.413447
std 0.598458 0.598458 844.811552 0.081421 0.102906
min 1.000000 1.000000 575.000000 43.144450 -79.904990
25% 1.000000 1.000000 2000.000000 43.645480 -79.440590
50% 1.000000 1.000000 2350.000000 43.667870 -79.394570
75% 2.000000 2.000000 2750.000000 43.730700 -79.374160
max 4.500000 4.500000 8000.000000 43.998188 -79.019540
In [ ]:
 
In [56]:
# df_toronto = df[df['Latitude'] >= 43.713689]
# df_toronto = df_toronto[df_toronto['Latitude'] <= 43.855465]
df = df[df['longitude'] >= -79.63967]
df = df[df['longitude'] <= -79.092422]
df.describe()
Out[56]:
NoofBedrooms NoofBathrooms Price Latitude longitude
count 2493.000000 2493.000000 2493.000000 2493.000000 2493.000000
mean 1.366426 1.366426 2488.595307 43.687627 -79.400395
std 0.587121 0.587121 847.549323 0.072896 0.079144
min 1.000000 1.000000 575.000000 43.144450 -79.636920
25% 1.000000 1.000000 2000.000000 43.646450 -79.428130
50% 1.000000 1.000000 2350.000000 43.668750 -79.392890
75% 2.000000 2.000000 2750.000000 43.733050 -79.371490
max 4.500000 4.500000 8000.000000 43.998188 -79.129740
In [57]:
df[df['Neighborhood'] == 'Rockcliffe-Smythe']
Out[57]:
ApartmentName Streetname NoofBedrooms NoofBathrooms Price Latitude longitude Neighborhood
1976 Beech Hall Housing Co-Operative 2 Humber Blvd 1.0 1.0 600.0 43.68286 -79.4808 Rockcliffe-Smythe
1977 Beech Hall Housing Co-Operative 2 Humber Blvd 1.0 1.0 812.0 43.68286 -79.4808 Rockcliffe-Smythe

One-Hot Encoding

In [58]:
# one hot encoding
df_onehot = pd.get_dummies(df[['NoofBedrooms','NoofBathrooms','Price']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
df_onehot['Neighborhood'] = df['Neighborhood'] 

# # move neighborhood column to the first column
# fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
# manhattan_onehot = manhattan_onehot[fixed_columns]

df_onehot = df_onehot.groupby('Neighborhood').mean().reset_index()

creating new dataframe with necessary data and average price.

In [59]:
# df_new = df[['Neighborhood','Price','Latitude','longitude','NoofBedrooms','NoofBathrooms']]
# df_new1 = df_new.groupby(['Neighborhood','NoofBedrooms','NoofBathrooms']).mean().reset_index()
# # df_new1 = df_new1.sort_values(ascending=True,by='Price').reset_index()
# # df_new1 = df_new1.drop('index',1)
# df_new1
In [60]:
# for lat,lon, street, apt in zip(df_toronto['Latitude'], df_toronto['longitude'], df_toronto['Streetname'], df_toronto['ApartmentName']):
#     #print("Borough - {}, Neighbourhood - {}, latitude - {}, longitude - {}".format(borough,neighbourhood,lat,lon))
#     try:
#         label = folium.Popup(apt,parse_html=True)

#         folium.CircleMarker(
#             location=[lat,lon],
#             color = 'blue',
#             radius = 3,
#             popup=label,
#             parse_html = True,clustered_marker = True).add_to(toronto_map)
#     except:
#         pass
    
# #toronto_map.fit_bounds([[43.749909, -79.639678],[43.581028, -79.542861],[43.855465, -79.170700],[43.713689, -79.092422]])
# display(toronto_map)

KMeans

In [61]:
from sklearn.cluster import KMeans
clust = 10
df_toronto_cluster = df_onehot.drop('Neighborhood',1)
kcluster = KMeans(n_clusters=clust, random_state=0).fit(df_toronto_cluster)
clusters = kcluster.labels_
len(clusters)
Out[61]:
153
In [62]:
df_onehot
Out[62]:
Neighborhood NoofBedrooms NoofBathrooms Price
0 Agincourt North 3.000000 3.000000 2000.000000
1 Agincourt South-Malvern West 1.441176 1.441176 2024.058824
2 Alderwood 1.250000 1.250000 2712.500000
3 Annex 1.328947 1.328947 2960.184211
4 Banbury-Don Mills 1.444444 1.444444 2447.555556
5 Bathurst Manor 2.142857 2.142857 2764.285714
6 Bay Street Corridor 1.337500 1.337500 3011.312500
7 Bayview Village 1.328125 1.328125 2263.437500
8 Bayview Woods-Steeles 1.416667 1.416667 2380.208333
9 Bedford Park-Nortown 1.250000 1.250000 2533.250000
10 Beechborough-Greenbrook 1.500000 1.500000 2206.000000
11 Bendale 1.263158 1.263158 2151.315789
12 Birchcliffe-Cliffside 1.428571 1.428571 2197.857143
13 Blake-Jones 2.000000 2.000000 3999.500000
14 Briar Hill-Belgravia 1.200000 1.200000 2230.000000
15 Broadview North 1.000000 1.000000 1950.000000
16 Cabbagetown-South St James Town 1.100000 1.100000 1982.400000
17 Casa Loma 1.363636 1.363636 2500.000000
18 Church-Yonge Corridor 1.195455 1.195455 2491.481818
19 City Centre 1.500000 1.500000 2066.666667
20 Clairlea-Birchmount 1.266667 1.266667 2266.000000
21 Clanton Park 1.133333 1.133333 1862.733333
22 Cliffcrest 2.000000 2.000000 2399.000000
23 Concord 1.000000 1.000000 2075.000000
24 Cooksville 1.525000 1.525000 2446.750000
25 Corso Italia-Davenport 1.416667 1.416667 2183.333333
26 Crescent Town 1.000000 1.000000 2100.000000
27 Danforth Village-Toronto 1.750000 1.750000 2625.000000
28 Dixie 1.266667 1.266667 2292.266667
29 Don Valley Village 1.350000 1.350000 2220.700000
... ... ... ... ...
123 Stonegate-Queensway 1.812500 1.812500 3175.000000
124 Tam O'Shanter-Sullivan 1.250000 1.250000 2112.500000
125 The Beaches 1.421053 1.421053 2690.263158
126 Thistletown-Beaumond Heights 1.250000 1.250000 1600.000000
127 Thorncliffe Park 1.295455 1.295455 1709.545455
128 Thornhill 1.428571 1.428571 2285.714286
129 Toronto 1.400000 1.400000 3009.800000
130 Trinity-Bellwoods 1.500000 1.500000 2409.444444
131 Unionville 1.250000 1.250000 2029.166667
132 University 1.500000 1.500000 2158.333333
133 Victoria Village 1.142857 1.142857 2013.571429
134 Waterfront Communities-the Island 1.286972 1.286972 2906.024648
135 West Hill 1.178571 1.178571 1800.000000
136 West Humber-Clairville 1.500000 1.500000 2075.000000
137 Westminster-Branson 1.458333 1.458333 2343.791667
138 Weston 1.000000 1.000000 1404.600000
139 Weston-Pellam Park 1.125000 1.125000 2525.000000
140 Wexford-Maryvale 1.375000 1.375000 2054.375000
141 Willowdale East 1.623077 1.623077 2606.230769
142 Willowdale West 1.566667 1.566667 2672.833333
143 Willowridge-Martingrove-Richview 1.200000 1.200000 2040.000000
144 Woburn 1.428571 1.428571 2128.285714
145 Woodbine Corridor 1.000000 1.000000 2131.666667
146 Woodbridge 3.000000 3.000000 2500.000000
147 Wychwood 1.200000 1.200000 2381.800000
148 Yonge-Eglinton 1.000000 1.000000 2324.750000
149 Yonge-St Clair 1.125000 1.125000 2386.875000
150 York University Heights 1.250000 1.250000 2284.125000
151 York-Haig 1.272727 1.272727 3104.545455
152 Yorkdale-Glen Park 1.384615 1.384615 2199.615385

153 rows × 4 columns

Inserting cluster no row at the 2nd position

In [63]:
df_onehot.insert(1,'Cluster',kcluster.labels_)
In [64]:
df_new1 = df_onehot.rename(columns={'Price':'AvgPrice','NoofBedrooms':'AvgBedrooms','NoofBathrooms':'AvgBathrooms'})
#df_new1 = df_new1.drop(['Latitude','longitude','NoofBedrooms','NoofBathrooms'],axis = 1)
In [65]:
df_new1.head()
Out[65]:
Neighborhood Cluster AvgBedrooms AvgBathrooms AvgPrice
0 Agincourt North 6 3.000000 3.000000 2000.000000
1 Agincourt South-Malvern West 6 1.441176 1.441176 2024.058824
2 Alderwood 5 1.250000 1.250000 2712.500000
3 Annex 3 1.328947 1.328947 2960.184211
4 Banbury-Don Mills 8 1.444444 1.444444 2447.555556

Merging dataframe with original and redirect output to new one.

In [66]:
df_copy = df
df_copy = df_copy.join(df_new1.set_index('Neighborhood'),on='Neighborhood')
df_copy.head()
Out[66]:
ApartmentName Streetname NoofBedrooms NoofBathrooms Price Latitude longitude Neighborhood Cluster AvgBedrooms AvgBathrooms AvgPrice
0 571 Prince Edward Dr N 571 Prince Edward Dr N 1.0 1.0 2000.0 43.66018 -79.50995 Kingsway South 3 1.666667 1.666667 2975.000000
1 571 Prince Edward Dr N 571 Prince Edward Dr N 1.0 1.0 2300.0 43.66018 -79.50995 Kingsway South 3 1.666667 1.666667 2975.000000
2 571 Prince Edward Dr N 571 Prince Edward Dr N 2.0 2.0 2700.0 43.66018 -79.50995 Kingsway South 3 1.666667 1.666667 2975.000000
3 571 Prince Edward Dr N 571 Prince Edward Dr N 2.0 2.0 3850.0 43.66018 -79.50995 Kingsway South 3 1.666667 1.666667 2975.000000
4 eCentral 15 Roehampton Ave 1.0 1.0 2002.5 43.70794 -79.39786 Mt Pleasant West 8 1.209677 1.209677 2380.431183

Mapping data for better visulization.

In [67]:
# set color scheme for the clusters
import matplotlib.cm as cm
import matplotlib.colors as colors
x = np.arange(clust)
ys = [i + x + (i*x)**2 for i in range(clust)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster, price, beds, baths in zip(df_copy['Latitude'], df_copy['longitude'], df_copy['Neighborhood'], df_copy['Cluster'], df_copy['AvgPrice'],df_copy['AvgBedrooms'],df_copy['AvgBedrooms']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster) + ' AvgPrice ' + str(price) + ' with avg beds ' + str(beds) + ' and avg bath ' + str(baths) ,parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(toronto_map)
    
display(toronto_map)

Now, we will go through each cluster or you can click on cluster points in above map. Below, what data you are looking at is important rather than data. Like you want min price and max price range then it is there, also, location range is also there. Explore it your will

Color:-- Red

In [68]:
df_copy[df_copy['Cluster'] == 0].describe()
Out[68]:
NoofBedrooms NoofBathrooms Price Latitude longitude Cluster AvgBedrooms AvgBathrooms AvgPrice
count 429.000000 429.000000 429.000000 429.000000 429.000000 429.0 429.000000 429.000000 429.000000
mean 1.414918 1.414918 2245.275058 43.715326 -79.433446 0.0 1.414918 1.414918 2245.275058
std 0.648767 0.648767 586.668604 0.075970 0.098465 0.0 0.221406 0.221406 54.857584
min 1.000000 1.000000 650.000000 43.512970 -79.636920 0.0 1.000000 1.000000 2151.315789
25% 1.000000 1.000000 1950.000000 43.646450 -79.525750 0.0 1.283333 1.283333 2194.615385
50% 1.000000 1.000000 2250.000000 43.716250 -79.442830 0.0 1.369565 1.369565 2263.437500
75% 2.000000 2.000000 2500.000000 43.773470 -79.349040 0.0 1.483516 1.483516 2286.054945
max 4.500000 4.500000 6500.000000 43.960270 -79.208440 0.0 3.000000 3.000000 2350.000000

Color:-- Purple

In [69]:
df_copy[df_copy['Cluster'] == 1].describe()
Out[69]:
NoofBedrooms NoofBathrooms Price Latitude longitude Cluster AvgBedrooms AvgBathrooms AvgPrice
count 111.000000 111.000000 111.000000 111.000000 111.000000 111.0 111.000000 111.000000 111.000000
mean 1.234234 1.234234 1787.819820 43.750689 -79.323465 1.0 1.234234 1.234234 1787.819820
std 0.512657 0.512657 482.147416 0.050072 0.115873 0.0 0.228717 0.228717 57.025655
min 1.000000 1.000000 650.000000 43.582590 -79.572620 1.0 1.000000 1.000000 1709.545455
25% 1.000000 1.000000 1600.000000 43.701970 -79.434970 1.0 1.100000 1.100000 1722.916667
50% 1.000000 1.000000 1750.000000 43.750300 -79.342300 1.0 1.178571 1.178571 1800.000000
75% 1.000000 1.000000 2013.500000 43.793610 -79.219060 1.0 1.295455 1.295455 1839.375000
max 4.000000 4.000000 3000.000000 43.849800 -79.132320 1.0 2.250000 2.250000 1862.733333
In [70]:
### Color:--  Red
In [71]:
df_copy[df_copy['Cluster'] == 2].describe()
Out[71]:
NoofBedrooms NoofBathrooms Price Latitude longitude Cluster AvgBedrooms AvgBathrooms AvgPrice
count 20.000000 20.000000 20.000000 20.000000 20.000000 20.0 20.000000 20.000000 20.000000
mean 1.825000 1.825000 4024.100000 43.606115 -79.360316 2.0 1.825000 1.825000 4024.100000
std 0.907208 0.907208 1159.016866 0.157061 0.101525 0.0 0.349812 0.349812 174.730288
min 1.000000 1.000000 2214.000000 43.158420 -79.584520 2.0 1.625000 1.625000 3907.000000
25% 1.000000 1.000000 3500.000000 43.665097 -79.336813 2.0 1.625000 1.625000 3960.000000
50% 1.250000 1.250000 3900.000000 43.666390 -79.319610 2.0 1.625000 1.625000 3960.000000
75% 2.625000 2.625000 4225.000000 43.667170 -79.317130 2.0 2.000000 2.000000 3999.500000
max 3.000000 3.000000 6750.000000 43.679870 -79.248140 2.0 2.750000 2.750000 4524.500000

Color:-- Blue

In [72]:
df_copy[df_copy['Cluster'] == 3].describe()
Out[72]:
NoofBedrooms NoofBathrooms Price Latitude longitude Cluster AvgBedrooms AvgBathrooms AvgPrice
count 454.000000 454.000000 454.000000 454.000000 454.000000 454.0 454.000000 454.000000 454.000000
mean 1.325991 1.325991 2949.914097 43.639886 -79.394489 3.0 1.325991 1.325991 2949.914097
std 0.504403 0.504403 996.108525 0.076591 0.042267 0.0 0.108929 0.108929 70.252955
min 1.000000 1.000000 1150.000000 43.161370 -79.595260 3.0 1.250000 1.250000 2906.000000
25% 1.000000 1.000000 2300.000000 43.642750 -79.395915 3.0 1.286972 1.286972 2906.024648
50% 1.000000 1.000000 2600.000000 43.646160 -79.390820 3.0 1.286972 1.286972 2906.024648
75% 2.000000 2.000000 3400.000000 43.659660 -79.384655 3.0 1.328947 1.328947 3011.312500
max 3.000000 3.000000 8000.000000 43.757650 -79.252830 3.0 2.000000 2.000000 3194.166667

Color:-- Light Blue

In [73]:
df_copy[df_copy['Cluster'] == 4].describe()
Out[73]:
NoofBedrooms NoofBathrooms Price Latitude longitude Cluster AvgBedrooms AvgBathrooms AvgPrice
count 8.000000 8.000000 8.000000 8.000000 8.000000 8.0 8.000000 8.000000 8.000000
mean 1.125000 1.125000 948.375000 43.757511 -79.314731 4.0 1.125000 1.125000 948.375000
std 0.353553 0.353553 422.517603 0.061964 0.149089 0.0 0.172516 0.172516 154.538888
min 1.000000 1.000000 575.000000 43.682860 -79.504040 4.0 1.000000 1.000000 706.000000
25% 1.000000 1.000000 675.000000 43.686737 -79.480800 4.0 1.000000 1.000000 889.000000
50% 1.000000 1.000000 806.000000 43.785720 -79.265060 4.0 1.000000 1.000000 1025.000000
75% 1.000000 1.000000 1050.000000 43.799903 -79.182823 4.0 1.333333 1.333333 1037.500000
max 2.000000 2.000000 1800.000000 43.821270 -79.162500 4.0 1.333333 1.333333 1075.000000

Color:-- Light Green

In [74]:
df_copy[df_copy['Cluster'] == 5].describe()
Out[74]:
NoofBedrooms NoofBathrooms Price Latitude longitude Cluster AvgBedrooms AvgBathrooms AvgPrice
count 466.000000 466.000000 466.000000 466.000000 466.000000 466.0 466.000000 466.000000 466.000000
mean 1.492489 1.492489 2674.770386 43.695374 -79.402081 5.0 1.492489 1.492489 2674.770386
std 0.662715 0.662715 887.426193 0.049778 0.044912 0.0 0.210572 0.210572 46.924763
min 1.000000 1.000000 650.000000 43.598900 -79.580740 5.0 1.250000 1.250000 2599.090909
25% 1.000000 1.000000 2100.000000 43.657595 -79.414708 5.0 1.305556 1.305556 2646.527778
50% 1.000000 1.000000 2572.500000 43.671160 -79.399230 5.0 1.528736 1.528736 2677.218391
75% 2.000000 2.000000 3168.750000 43.756500 -79.376990 5.0 1.566667 1.566667 2693.154025
max 4.500000 4.500000 7500.000000 43.861959 -79.281300 5.0 3.000000 3.000000 2769.700000

Color:-- Light Dark Green

In [75]:
df_copy[df_copy['Cluster'] == 6].describe()
Out[75]:
NoofBedrooms NoofBathrooms Price Latitude longitude Cluster AvgBedrooms AvgBathrooms AvgPrice
count 240.000000 240.000000 240.000000 240.000000 240.000000 240.0 240.000000 240.000000 240.000000
mean 1.297917 1.297917 2042.266667 43.744394 -79.351329 6.0 1.297917 1.297917 2042.266667
std 0.563718 0.563718 543.929992 0.057269 0.103195 0.0 0.241755 0.241755 46.265996
min 1.000000 1.000000 600.000000 43.595020 -79.635140 6.0 1.000000 1.000000 1937.500000
25% 1.000000 1.000000 1799.000000 43.709460 -79.408810 6.0 1.095238 1.095238 2000.000000
50% 1.000000 1.000000 2000.000000 43.742725 -79.327785 6.0 1.304348 1.304348 2042.523810
75% 1.500000 1.500000 2315.000000 43.780080 -79.281170 6.0 1.441176 1.441176 2075.361111
max 4.000000 4.000000 4500.000000 43.905540 -79.129740 6.0 3.000000 3.000000 2132.666667

Color:-- Autumn Leaf green

In [76]:
df_copy[df_copy['Cluster'] == 7].describe()
Out[76]:
NoofBedrooms NoofBathrooms Price Latitude longitude Cluster AvgBedrooms AvgBathrooms AvgPrice
count 23.000000 23.000000 23.000000 23.000000 23.000000 23.0 23.000000 23.000000 23.000000
mean 1.695652 1.695652 3390.434783 43.615222 -79.438381 7.0 1.695652 1.695652 3390.434783
std 0.686545 0.686545 1325.836711 0.167629 0.100593 0.0 0.338416 0.338416 41.050035
min 1.000000 1.000000 890.000000 43.144450 -79.591420 7.0 1.250000 1.250000 3300.000000
25% 1.000000 1.000000 2672.500000 43.635545 -79.471108 7.0 1.566667 1.566667 3375.000000
50% 1.500000 1.500000 3400.000000 43.658200 -79.421320 7.0 1.566667 1.566667 3405.333333
75% 2.000000 2.000000 4197.500000 43.659765 -79.415485 7.0 1.658333 1.658333 3405.333333
max 3.000000 3.000000 6500.000000 43.998188 -79.200500 7.0 2.500000 2.500000 3500.000000

Color:-- Orange

In [77]:
df_copy[df_copy['Cluster'] == 8].describe()
Out[77]:
NoofBedrooms NoofBathrooms Price Latitude longitude Cluster AvgBedrooms AvgBathrooms AvgPrice
count 713.000000 713.000000 713.000000 713.000000 713.000000 713.0 713.000000 713.000000 713.000000
mean 1.310659 1.310659 2463.862693 43.669072 -79.413709 8.0 1.310659 1.310659 2463.862693
std 0.530580 0.530580 652.288359 0.046544 0.061536 0.0 0.154945 0.154945 50.502988
min 1.000000 1.000000 800.000000 43.560920 -79.636270 8.0 1.000000 1.000000 2380.208333
25% 1.000000 1.000000 2100.000000 43.639690 -79.425940 8.0 1.209677 1.209677 2434.487179
50% 1.000000 1.000000 2350.000000 43.656460 -79.397860 8.0 1.244792 1.244792 2486.842105
75% 2.000000 2.000000 2750.000000 43.688200 -79.379610 8.0 1.410256 1.410256 2505.405405
max 4.500000 4.500000 6975.000000 43.866710 -79.227570 8.0 3.000000 3.000000 2568.750000
In [ ]:
 
In [78]:
df_copy[df_copy['Cluster'] == 9].describe()
Out[78]:
NoofBedrooms NoofBathrooms Price Latitude longitude Cluster AvgBedrooms AvgBathrooms AvgPrice
count 29.000000 29.000000 29.000000 29.000000 29.000000 29.0 29.000000 29.000000 29.000000
mean 1.189655 1.189655 1509.172414 43.740192 -79.371178 9.0 1.189655 1.189655 1509.172414
std 0.410004 0.410004 562.516290 0.027818 0.123140 0.0 0.220644 0.220644 87.583239
min 1.000000 1.000000 700.000000 43.699420 -79.555460 9.0 1.000000 1.000000 1404.600000
25% 1.000000 1.000000 962.000000 43.725570 -79.512870 9.0 1.000000 1.000000 1425.000000
50% 1.000000 1.000000 1544.000000 43.734900 -79.289490 9.0 1.062500 1.062500 1468.000000
75% 1.000000 1.000000 2000.000000 43.763590 -79.260690 9.0 1.375000 1.375000 1612.375000
max 2.500000 2.500000 2450.000000 43.810780 -79.251720 9.0 1.750000 1.750000 1612.375000

Results and Discussions

Now, we have analyzed data and found groups of neighborhoods of similar kind based on house pricing features. Now, each cluster represents set of neighborhoods.

After analyzing clusters on the map, we can say that most of the clusters are located at the heart of Toronto with high density area. Also, some of the clusters are on the outer region, they do have rail and subway network to keep them connected with the main city. Final decision of the house totally depends on the person's work, rent he/she can afford, how much people one is gonna stay with and is the room is gonna be shared on single.

Conclusion

Here we have made clusters based on average price of a neighborhood and final selection totally depends on the user's final requirements and pricing of the house. Also, location of job/School he/she's enrollled in.

This can be taken one step more with the involvement of the neighboring places and involving them in the clustering process to better isolate close laying neighborhoods. Also, if user requirement is like school should be within 2km. or grocery store near the house at a walking distance, then that should be done after the clusting is complete. That would provide better final outcomes and an overall model which can fit anny user's requirement.

In [ ]: